[issue-6019] [P SDK] Fix cache tokens not tracked in Bedrock Claude streaming#6038
Closed
ollieagent[bot] wants to merge 5181 commits into
Closed
[issue-6019] [P SDK] Fix cache tokens not tracked in Bedrock Claude streaming#6038ollieagent[bot] wants to merge 5181 commits into
ollieagent[bot] wants to merge 5181 commits into
Conversation
Co-authored-by: Andres Cruz <andresc@comet.com>
…t-scoped operations (#5694) * [OPIK-4932] [OPIK-4937] [BE] Add project_id to experiments table and support project scoping Add project_id column with minmax index to ClickHouse experiments table. Update ExperimentService, ExperimentDAO, and ExperimentsResource to support project_id on create (via projectId or projectName) and filtering on list. * Revision 2: Make resolveProjectId reactive in experiment creation chain Chain resolveProjectId as a flatMap in the reactive pipeline instead of calling it synchronously, avoiding blocking the reactive chain. * Revision 3: Move project_id filter into UNION branches and fix empty-string bug - Move project_id predicate from outer WHERE into each UNION branch for FIND and FIND_GROUPS queries, so aggregations are filtered early - Fix arrayConcat to not append '' when experiment project_id is null, which was incorrectly triggering the project_deleted predicate - Use agg.project_ids directly in aggregated branches since experiments_from_aggregates_final already has the project_id * [OPIK-4932] [BE] fix: use createPartialExperiment in DatasetsResourceTest Use experimentResourceClient.createPartialExperiment() instead of factory.manufacturePojo(Experiment.class) to avoid PODAM generating random projectId values that fail validation after project_id support was added to experiments. * [OPIK-4932] [BE] fix: capitalize SQL FROM keyword in ExperimentDAO * [OPIK-4932] [BE] fix: address PR comments — non-nullable project_id, alias refs, materialize index - Change project_id from Nullable(FixedString(36)) to FixedString(36) DEFAULT '' to avoid Nullable performance overhead in ClickHouse - Replace coalesce/isNull with if(notEmpty(...))/empty() for non-nullable column - Reference pre-computed combined_project_ids/project_ids aliases in WHERE clauses instead of repeating arrayConcat expressions - Add MATERIALIZE INDEX to populate idx_project_id on existing data - Remove unnecessary toString() calls since project_id is already a string * [OPIK-4932] [BE] fix: use String instead of FixedString(36) for project_id column * [OPIK-4932] [BE] fix: align migration with liquibase conventions - Group rollbacks at end of changeset - Remove inline comments between statements - Fix --rollback empty format (no semicolon) * [OPIK-4932] [BE] fix: rename index to idx_experiments_project_id and use GRANULARITY 1 * [NA] [BE] fix: rename migration 000068 to 000070 to avoid conflict with main * [OPIK-4932] [BE] test: fix createExperimentWithProjectName and createExperimentWithNewProjectName to include projectId in expected ---------
…d stream endpoints (#5713) * [OPIK-4934] [BE] feat: add project_name filter to dataset retrieve and stream endpoints * [OPIK-4934] [BE] test: fix compilation and add project_name filter tests - Fix positional DatasetIdentifier/DatasetItemStreamRequest constructors to use builder pattern after new fields were added - Add ProjectService mock to DatasetsResourceIntegrationTest constructor - Add 4 new integration tests covering project_name filter behavior: - getDatasetByIdentifier with valid project_name returns scoped dataset - getDatasetByIdentifier with non-existing project_name falls back gracefully - streamDatasetItems with valid project_name returns scoped items - streamDatasetItems with non-existing project_name falls back gracefully * [OPIK-4934] [BE] fix: address PR review comments - Rename resolveProjectName to resolveProjectIdByName for clarity - Only set resolved projectId when non-null to avoid clobbering an existing projectId on the request - Log only datasetName and projectId instead of full request object to avoid exposing user-supplied filter strings in logs * [OPIK-4934] [BE] refactor: add findProjectIdByName helper to ProjectService Extract the repeated projectService.findByNames(...).stream().findFirst().map(Project::id) pattern into a shared Optional<UUID> findProjectIdByName(workspaceId, projectName) method on ProjectService. DatasetsResource.resolveProjectIdByName() now delegates to it. * [OPIK-4934] [BE] fix: address PR review comments - Add @JsonIgnore to projectId in DatasetItemStreamRequest to prevent client deserialization of server-internal field - Guard resolveProjectIdByName in streamDatasetItems so client-supplied projectId is never clobbered by name resolution - Introduce DatasetCriteria overload on DatasetService.findByName for consistency with the find() API - Use DatasetCriteria in DatasetsResource.getDatasetByIdentifier - Add getDatasetByIdentifier/callGetDatasetByIdentifier to DatasetResourceClient test helper; refactor inline REST calls to use it - Use builder pattern for DatasetIdentifier in tests - Add streaming test covering project_id filter * fix(stream): use client projectId directly when present instead of null Use request.projectId() as-is when the client supplies it, falling back to resolveProjectIdByName only when projectId is absent. The previous form set resolvedProjectId to null on the else-branch, which happened to work because resolvedRequest fell back to the original request (which already carried projectId), but was misleading and fragile.
* [OPIK-4713] add new permission * [OPIK-4713] add permission checks
* [OPIK-5044] [BE] feat: add P2 workspace permission annotations Add P2 scope permissions from the workspace permissions spec: - WORKSPACE_SETTINGS_CONFIGURE for workspace config upsert/delete - USER_ROLE_UPDATE (enum only, no endpoint yet) - AI_PROVIDER_UPDATE for LLM provider key create/update/delete - ANNOTATION_QUEUE_ANNOTATE for adding items to annotation queues Includes RequiredPermissionsTest for all annotated endpoints. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test(permissions): add permission denial tests for P2 endpoints Add tests verifying endpoints return 403 when the auth service denies the required permission. Covers all P2-annotated endpoints plus existing ANNOTATION_QUEUE_DELETE endpoints. Adds AuthTestUtils.mockTargetWorkspaceDenyPermission() helper and call* client methods for AnnotationQueuesResourceClient. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Andres Cruz <andresc@comet.com>
…CompareRedirect (#5733) * e2e fix * fix(tests): apply prettier formatting to OptimizationCompareRedirect.test.tsx
…lete system overview (#5735) * [OPIK-5096] [DOCS] docs: update self-host architecture page with complete system overview Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(architecture): add updated draw.io architecture diagram Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Updated image * docs(architecture): add ClickHouse and MySQL schema draw.io diagrams Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Updated images * Optimised images with calibre/image-actions ---------
…ments (#5693) * [OPIK-4615] [DOCS] docs: update dashboard documentation for GA refinements Rewrite dashboards.mdx to reflect the OPIK-4615 dashboard overhaul: - Remove beta notice, templates section, and "Working with templates" - Add dashboard types (Multi-project/Experiments) and scope (Workspace/Insights) - Add Insights tab section (built-in Project Overview, custom views, views selector, auto-save) - Add Leaderboard widget documentation - Update widget docs (per-widget project selector, unified modal, filter/group) - Update saving behavior (workspace Save/Discard, Insights auto-save, built-in read-only) - Update date range filtering (decoupled from save state) - Replace old screenshots with new ones, remove unused images - Update production_monitoring.mdx to reference Insights tab Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Optimised images with calibre/image-actions * docs(dashboards): add experiment pages to Insights description Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs(dashboards): document breakdown fields per data source and aggregation toggle Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs(dashboards): rename project overview image to dashboard_example.png Fixes missing image reference in changelog. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…nt (#5724) * delete env endpoint * address comments * remove unneeded validation
* OPIK-4449 initial * OPIK-4449 experiment unnified, removed FF * OPIK-4449 execution policy + evaluators * OPIK-4449 fix linter * OPIK-4449 fix deps circular issue * OPIK-4449 fix description column accessor and polish eval suite UI Fix table column reading description from row.data.description instead of row.description. Also remove card shadows, reduce textarea min rows, and clean up editItem merge logic in draft store. * OPIK-4449 fix toggling off item-level execution policy override Stop stripping undefined values in the draft store so that execution_policy: undefined is preserved as an explicit "clear override" signal. Use the `in` operator in UI components to distinguish absent keys from explicit undefined, and emit clear_execution_policy flag in the save payload for the backend. * OPIK-4449 rename remaining "dataset" labels to "evaluation suite" in UI * OPIK-4449 generate descriptions and assertions when expanding eval suite items with AI Enhance the "Expand with AI" feature to automatically generate item-level descriptions and LLM judge assertions for evaluation suite items. The prompt now requests _opik_description and _opik_evaluator_assertions magic keys in the generated JSON, which are extracted and converted to proper item fields before adding to the draft store. Suite-level evaluator assertions are included in the prompt context to avoid duplication. The generated samples preview dialog now shows descriptions and assertions inline. * OPIK-4449 fix prettier lint in ExecutionPolicyCell * OPIK-4449 experiment updates * OPIK-4449 merge fixes * refactor * OPIK-4449 open full panel with all fields when creating eval suite item Wire "Create evaluation suite item" button to create a draft item in the store and open EvaluationSuiteItemPanel instead of the legacy AddDatasetItemSidebar/AddEditDatasetItemDialog. The panel detects new items via draftStatus and adjusts UI accordingly (title, dropdown, navigation). Regular datasets keep the existing flow unchanged. * [OPIK-4449] Rename internal evaluator references to assertion Rename evaluator-converters.ts to assertion-converters.ts and update all internal function names and imports to use assertion-centric naming. Rename OPIK_EVALUATOR_ASSERTIONS_FIELD constant to OPIK_ASSERTIONS_FIELD. API-facing field names (evaluators on DatasetItem/DatasetVersion) are preserved unchanged. * OPIK-4449 new Figma updates * Fix lint issues: remove unused useMemo import and fix prettier formatting * Fix prettier formatting in remaining eval suite files * OPIK-4449 refactoring * OPIK-4449 fix issues * OPIK-4449 experiment UI updates * OPIK-4449 pass rate * OPIK-4449 merge main * [OPIK-5030] [FE] Fix UI crash on unfinished eval suite experiments Guard pass_rate with isNumber() before assigning to scores object, preventing null/undefined from propagating into chart tick calculator. Harden useChartTickDefaultConfig to filter non-finite values and bail early in generateNiceTicks, avoiding infinite loop and RangeError. * OPIK-4449 fix issues * [OPIK-5039] [FE] Fix view evaluation item button and lint * OPIK-4449 add type for create modal * OPIK-4449 remove internal plan/design docs from branch * OPIK-4449 fix prettier formatting issues
…ling (#5720) * [NA] [SDK] feat: instrument Anthropic beta API and fix compaction billing - Patch `client.beta.messages.create` and `client.beta.messages.stream` in `track_anthropic` so beta API calls are traced like the standard API - Add `patch_sync/async_beta_message_stream_manager` in stream_patchers.py to handle `BetaMessageStreamManager`/`BetaAsyncMessageStreamManager` - Add `AnthropicUsage.get_billable_tokens()` that sums all `iterations` when compaction fires (top-level tokens exclude the compaction iteration) https://platform.claude.com/docs/en/build-with-claude/compaction#understanding-usage - Store raw `usage.iterations` in span metadata for visibility - Add integration tests for beta create/stream (sync and async) - Add unit tests for compaction iteration billing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(anthropic): include cache tokens per iteration when compaction+caching combined SDK types (BetaMessageIterationUsage, BetaCompactionIterationUsage) confirm that cache_creation_input_tokens and cache_read_input_tokens are always present on each iteration. Mirror the non-compaction path: add cache_read to prompt and cache_creation to completion for every iteration. Also fix test data to be consistent with the doc: top-level input/output_tokens reflect only the non-compaction iterations, not 45000/1234 which was inconsistent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(anthropic): cache_creation_input_tokens is input cost not output total_input = input_tokens + cache_read_input_tokens + cache_creation_input_tokens https://platform.claude.com/docs/en/build-with-claude/prompt-caching#tracking-cache-performance Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(anthropic): address Baz review comments - Guard beta stream imports with try/except ImportError; set module-level sentinels to None so the rest of the module loads on older anthropic SDK versions - Use inline try/except ImportError for beta isinstance checks in _streams_handler, following the existing SDK pattern (see crewai patcher) - Fix logger calls: remove stray str(exception) arg with no %s placeholder - Narrow except Exception -> except AttributeError when patching beta methods Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
… and schema flattening (#5707) * [NA] [SDK] fix: flatten LLMJudge JSON schema to avoid $ref indirection Inline all $defs/$ref references in the ResponseSchema JSON schema sent to providers. This reduces grammar-constrained decoding complexity for Anthropic models (especially Haiku) which frequently return incomplete JSON when schemas use $ref indirection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(llm-judge): add tenacity retry and parse validation Parse now validates result count and scoring_failed status, raising LLMJudgeParseError with partial results attached. The retry decorator on _generate_and_parse retries up to 3 times on parse failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(litellm): extract structured output from tool_calls when content is null litellm implements response_format via tool_use for Anthropic models. Under concurrent load ~5% of responses arrive with content=null and the JSON in tool_calls[0].function.arguments instead. Only extracts from tool calls named "json_tool_call" (litellm's structured output marker) to avoid interfering with real tool use responses. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(anthropic): add native AnthropicChatModel for evaluation Cherry-picked from alexkuzmik/native-anthropic-model. Adds a native Anthropic client that bypasses litellm for anthropic/ prefixed models, with proper param filtering and structured output via tool_use. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(anthropic): use messages.parse for native structured output Switch from tool_use workaround to Anthropic's native structured output API (messages.parse with output_format). The SDK handles schema transformation internally and returns JSON as text content, eliminating the tool_use indirection entirely. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(anthropic): add 60s timeout to Anthropic client connections Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix lint errors * feat(anthropic): add tracking for messages.parse structured output Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(llm-judge): address PR review feedback - Handle None input in ResponseSchema.parse to avoid TypeError crash - Remove `stream` from _SUPPORTED_PARAMS to prevent streaming responses in generate_string - Log LLMJudgeParseError with stack trace before returning partial results Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address petrotiurin review feedback - Revert unintended package-lock.json change - Extract DEFAULT_MAX_TOKENS constant for easier discovery - Add depth limit (50) to _resolve_refs recursion to prevent loops Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(anthropic): add tracking for beta.messages.parse Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: remove unintended frontend and optimizer files from branch Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: remove optimizer reflection logs accidentally added in merge Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…laceholder for API key entry (#5740)
* [OPIK-5097] [FE] chore: add v1/v2 scaffolding, docs, and agent rules Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [OPIK-5097] [FE] refactor: move components/ into ui/, shared/, v1/ structure Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…layer (#5727) * [OPIK-4934] [BE] refactor: move project name resolution into service layer Address andrescrz review feedback: resolveProjectIdByName logic belongs in the service, not the resource. - Add DatasetService.findByName(workspaceId, DatasetIdentifier, visibility) overload that resolves project name internally via ProjectService - Inject ProjectService into DatasetItemServiceImpl; move project name->id resolution inside getItems so the resource no longer needs it - Remove resolveProjectIdByName helper, ProjectService field and import from DatasetsResource - Fix DatasetsResourceIntegrationTest to match updated constructor * [OPIK-4934] [BE] refactor: address baz review — strict project validation and reactive resolution - DatasetService.findByName(identifier): throw NotFoundException when projectName is provided but no matching project exists, instead of silently falling back to a workspace-wide search - DatasetItemService.getItems: remove pre-reactive ProjectService call; delegate project resolution to DatasetService.findByName(DatasetIdentifier) inside Mono.fromCallable so blocking lookup runs on boundedElastic scheduler - Remove ProjectService injection from DatasetItemServiceImpl (no longer needed) * fix(tests): restore fallback and ignore ttft in trace assertions - DatasetService.findByName: revert orElseThrow to orElse(null) so that a non-existent project name falls back to searching without project scope, preserving the documented contract tested by getDatasetByIdentifier__whenNonExistingProjectName and streamDataItems__whenNonExistingProjectName - TraceAssertions: add ttft to IGNORED_FIELDS_TRACES to prevent flaky failures from Double precision loss when ClickHouse stores/returns the field (findWithImageTruncation parameterized tests were affected) * fix(tests): compare ttft with Double tolerance instead of ignoring Replace the blanket ttft ignore with a proper Double comparator. findWithImageTruncation now builds a RecursiveComparisonConfiguration with StatsUtils::compareDoubles (abs tolerance 1e-6) so the field value is still verified but floating-point ULP differences from ClickHouse storage are tolerated. Consistent with assertTraces() which already used this comparator for all Double fields. * Update GetTracesByProjectResourceTest.java
* [OPIK-4712] Add new permission * [OPIK-4712] Hide create/clone dashboard CTAs * [OPIK-4712] disable template editing * [OPIK-4712] do not disable view switching * [OPIK-4712] remove redundancies * [OPIK-4712] remove redundancies * [OPIK-4712] remove view permission check from insights * [OPIK-4712] remove view permission check from insights * [OPIK-4712] remove redundancies
* [OPIK-4449] [FE] revert UI labels from "Evaluation suites" back to "Datasets" Revert user-facing text to show "Datasets" until evaluation suites feature is released. Hide the type selector in the create modal so it always creates a dataset. * [OPIK-4449] [FE] fix: prettier formatting
…o prompt version endpoints (#5736) * [OPIK-4935] [BE] feat(api): add project_name and project_id scoping to prompt version endpoints Add project_name and project_id fields to POST /v1/private/prompts/versions/retrieve and POST /v1/private/prompts/versions, allowing callers to scope prompt lookups and creation to a specific project. - PromptVersionRetrieve: add optional project_name (filters lookup to given project) - CreatePromptVersion: add optional project_id (takes precedence) and project_name - PromptService: extract resolveProjectId() helper; use strict DAO lookup in retrievePromptVersion to prevent fallback to workspace-level when project_name is explicitly provided; validate project_id existence when supplied - PromptResourceTest: update all PromptVersionRetrieve call sites to use builder, add project_name scoping and validation tests Implements OPIK-4935 * fix(prompts): throw NotFoundException when project_name is provided but not found in retrieve * fix(prompt): restore workspace-level fallback in retrievePromptVersion Use the private findByName helper (which already handles the project → workspace fallback) instead of calling promptDAO.findByName directly, so retrieval behaves consistently with the creation path. * test(prompt): update retrieve fallback test to match workspace-wide fallback design When the project-level lookup misses, retrievePromptVersion falls back to a workspace-wide search. Updated the test to assert 200 (found via fallback) instead of 404. * test(prompt): add missing retrieve scenarios for project-scope fallback - workspace-level prompt found when projectName is specified (fallback) - 404 when projectName does not exist in the workspace * [OPIK-4981] [BE] Fall back to workspace-wide in retrievePromptVersion when project not found When projectName is provided but does not resolve to an existing project, pass null as projectId so the lookup falls through to workspace-wide search instead of throwing 404.
* initial * pr comments * pr comments
…riment and optimization queries (#5745) * [OPIK-5150] [BE] perf: replace FINAL with ORDER BY/LIMIT 1 BY in experiment and optimization queries Replace ClickHouse FINAL modifier with explicit ORDER BY DESC + LIMIT 1 BY pattern across ExperimentDAO and OptimizationDAO queries for improved query performance in optimization studio. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address PR review - two-layer dedup pattern, remove redundant LIMIT 1 BY, narrow SELECTs - Use dedup-then-filter subquery pattern: inner query deduplicates with immutable sort-key filters only, outer query applies mutable column filters (name, type, optimization_id, tags, etc.) to prevent phantom rows - Remove LIMIT 1 BY from feedback_scores/authored_feedback_scores CTEs (redundant with downstream ROW_NUMBER, and missing author key would drop authored scores) - Narrow SELECT * to specific columns in spans and experiment_aggregates helper CTEs where only a few columns flow downstream Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…ce fallback (#5744) * [OPIK-4962] Add deprecation header handling to HTTPX client * Implemented logic to log deprecation warnings for responses containing the X-Opik-Deprecation header, ensuring warnings appear only once per path during the session. * Added related unit tests to validate behavior for scenarios with and without deprecation headers. * Refactored HTTPX client to include warning tracking and integrate logging functionality. * [OPIK-4962] Improve HTTPX client deprecation warning handling and fix tests - Ensure deprecation warnings are logged once per method and path combination. - Fix unit tests to validate warnings with updated message formatting. - Add client closing to improve testing cleanup.
* [OPIK-4942] [BE] POC: Separate assertion_results table (Option E)
Demonstrates the architecture for storing assertion results in a dedicated
ClickHouse table instead of piggybacking on feedback_scores with
category_name='suite_assertion'.
Changes:
- New assertion_results ClickHouse table (migration 000070)
- AssertionResultDAO for writing assertion data to the new table
- FeedbackScoreDAO splits writes: assertions -> assertion_results, regular -> feedback_scores
- ExperimentItemDAO STREAM query adds assertion_results_per_trace CTE
- ExperimentItemMapper passes assertions_array to enrichWithAssertions
- AssertionResultMapper reads from dedicated column instead of partitioning feedback scores
Not included in this POC (would be needed for production):
- DatasetItemDAO/DatasetItemVersionDAO assertion CTE changes
- ExperimentAggregatesDAO pass rate aggregation from new table
- REST endpoint exclude_category_names cleanup
- Data migration for existing installations
- SDK changes
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Fix test compilation: inline suite_assertion constant
The SUITE_ASSERTION_CATEGORY constant was removed from AssertionResultMapper
in the Option E refactor, but the test still referenced it.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Add assertion_results_per_trace CTE to compare endpoint
- Add assertion_results_per_trace CTE to DatasetItemVersionDAO (both
has_aggregated and has_raw branches) — compare endpoint was using
DatasetItemVersionDAO, not DatasetItemDAO which had the CTE
- Add arp.assertions_array at tuple index 19 in both branches
- Remove group.size() <= 1 guard in AssertionResultMapper.computeRunSummaries()
so run summaries are emitted when a dataset item has 1 run per experiment
- Add assertion_scores_avg Map column to experiment_aggregates (migration 000071)
- Add AssertionScoreAverage API record
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Revert package-lock.json — unintentional change from lint hook
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Fix test compilation errors
- AssertionResultMapperTest: update enrichWithAssertions calls to use
(item, jsonString) signature; rewrite tests for assertion_results
table approach (no longer reads from feedbackScores); update
computeRunSummaries_singleRun test to reflect removed group.size()<=1 guard
- ExperimentsResourceTest: remove extra null arg from getFeedbackScoreNames
calls (leftover from older branch version of the method)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Bump migration numbers to 071 and 072
000070 conflicts with 000070_add_project_id_to_experiments.sql from main.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Use JsonUtils import in AssertionResultMapper
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Update test: runSummaries emitted for single-run suite experiments
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Remove misleading comment from runSummaries test
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Narrow catch to JsonProcessingException in AssertionResultMapper
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Add assertionScores to EXPERIMENT_IGNORED_FIELDS in test
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Add assertionScores assertions to PassRate tests + fix score routing
- Add .categoryName("suite_assertion") to PassRate.score() helper so
scores route to assertion_results table (required for pass rate SQL)
- Fix itemThreshold test: set per-item executionPolicy in createDatasetItems
instead of applyDatasetItemChanges to avoid version-2 row-ID mismatch
- Add assertionScores assertions to 4 tests: thenReturnPassRate (2/3),
multipleAssertions (scoreName1=1.0, scoreName2=0.5), passThresholdNotMet
(1/3), and itemThreshold (4/6)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Address PR review comments: AssertionStatus enum, JsonUtils, SQL cleanup
- Use JsonUtils.readValue() instead of getMapper().readValue() (comment #1)
- Replace explicit CAST with tuple() in SQL for type flexibility (comments #2, #3)
- Change passed column from UInt8 to Enum8('passed'=0,'failed'=1) (comment #4)
- Add AssertionStatus enum used end-to-end from DB to API response
- Update all SQL queries using toFloat64(passed) to toFloat64(passed = 'passed')
- Add project_id filter to assertion_results query in DatasetItemVersionDAO (comment #6)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Fix test compilation: use AssertionStatus enum instead of boolean assertions
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Add AssertionResultService + fix toJSONString tuple serialization
Move assertion score routing from FeedbackScoreDAO to service layer via
dedicated AssertionResultService. Fix assertion_results query where
toJSONString(tuple(...)) produced arrays instead of objects — use CAST
with named Tuple type so toJSONString emits JSON objects matching
AssertionResultRow record.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Remove suite_assertion exclusion tests from Traces and Projects
suite_assertion scores now go to the separate assertion_results table,
so exclude_category_names filtering is no longer needed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [OPIK-4942] [BE] Return boolean passed field in AssertionResult API response
Map AssertionStatus enum to boolean in AssertionResultMapper so
SDK/FE consumers receive passed: true/false instead of passed/failed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Andres Cruz <andresc@comet.com>
…ements (#5715) * use config to setup LocalRunnerReaperJob * create string RedissonClient bean * Use reactive Redis client for nextJob long-poll * refactor heartbeat * add reaperMaxRunnersPerCycle * address comments * refactor local runner reaper * moved polled job to active list in a transactional manner * address comments * fix potential race condition * address comments
…ually loads (#5954) * [OPIK-4828] [FE] fix: prevent No Data flash and stale thread data in ThreadDetailsPanel - Remove keepPreviousData from useThreadById and useTracesList to avoid showing the previous thread's data when switching between threads - Change && to || in renderContent loading check so that a Loader is shown whenever either query is pending, not only when both are pending simultaneously (which caused No Data to appear if traces resolved first) * feat(db): add bloom filter skip index on traces.thread_id OPIK-4828: thread_id had no skip index, causing full partition scans when filtering traces by thread in the thread view. The bloom filter at 1% false positive rate allows ClickHouse to skip ~99% of irrelevant granules for thread_id lookups. * fix(threads): enable skip index on FINAL scan in findById query Add SETTINGS use_skip_indexes_if_final = 1 so the bloom filter on traces.thread_id is applied even when reading with FINAL, allowing ClickHouse to skip irrelevant granules during thread view loading. * [OPIK-4828] [BE] perf: remove FINAL from append-only feedback score tables in thread queries Remove FINAL modifier from feedback_scores and authored_feedback_scores reads across all thread DAO queries. authored_feedback_scores is append-only so FINAL is unnecessary overhead; feedback_scores reads are deduplicated via ROW_NUMBER() window function downstream so FINAL is also redundant. Also optimise SELECT_TRACES_THREAD_BY_ID with a two-phase CTE approach: narrow to matching trace IDs first (with FINAL), then read full rows without FINAL using LIMIT 1 BY for deduplication. Benchmarked ~5x faster on production for threads with large trace counts. * Fix issue * fix(threads): remove trailing whitespace in ThreadDAO SQL queries * [OPIK-4828] [BE] Address PR review: move use_skip_indexes_if_final to global config, fix redundant dedup * [OPIK-4828] [BE] Remove remaining per-query use_skip_indexes_if_final settings * Fix tests * Update and rename 000076_add_thread_id_skip_index_to_traces.sql to 000077_add_thread_id_skip_index_to_traces.sql
* init kpi cards; * finish kpi cards and graph; * refactor; * eslint issues; * baz review comments; * eslint issues; * revert endtime; --------- Co-authored-by: aadereiko <aliaksandr@comet.com>
… in runner (#5952) * [OPIK-5326] [SDK] feat: cast job input values to declared param types in runner Add input type casting to both the TypeScript and Python in-process runner loops so agent functions receive correctly-typed arguments regardless of how the server serialised the values in job.inputs. TypeScript: export castInputValue() from InProcessRunnerLoop, apply it in invokeAgent using each Param's declared type (boolean / number / string). Python: add cast_input_value() to in_process_loop, apply it in _execute_job for all keys that match a registered param (bool / int / float / str); keys such as opik_args that are not in params pass through unchanged. Both implementations follow the same pattern as typeHelpers.ts: primitives are cast natively, complex types (dict/list) are JSON-serialised as strings, and null/None passes through unchanged. Unit tests added for both SDKs using parametrisation to cover each type individually and a set of multi-param combination scenarios. Implements OPIK-5326 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [OPIK-5326] [SDK] refactor: reuse type_helpers deserialization in runner casting Address baz-reviewer feedback on PR #5952: TypeScript (comment #3009010633): castInputValue now delegates to deserializeValue() from typeHelpers.ts for string→boolean and string→number conversions. Adds a Number.isNaN guard so non-numeric strings (e.g. "abc") throw TypeError instead of silently passing NaN to the agent function. Python (comment #3009010639): extract_params now calls unwrap_optional() from type_helpers.py before extracting the type name, so Optional[int] annotations correctly store type="int" instead of the raw "typing.Optional[int]" string. cast_input_value is rewritten to delegate to backend_value_to_python_value() from type_helpers.py, unifying the conversion logic across AgentConfig and the runner. Python (comment #3009010648): renamed all test functions in test_cast_input_value.py to follow the repo convention test_WHAT__CASE_DESCRIPTION__EXPECTED_RESULT. Added tests for Optional[T] unwrapping in extract_params and for the new backend type name aliases ("boolean", "integer", "string"). Skipping comment #3009010644 (bool "1"→True): strict "true"/"false" only behaviour is intentional and mirrors the TypeScript SDK — the backend serialises booleans as true/false, not "1"/"yes". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [OPIK-5326] [SDK] refactor: move type_helpers to shared path type_helpers.py / typeHelpers.ts were under agent_config but are now used by the runner as well. Move them to a neutral location: Python: api_objects/agent_config/type_helpers.py → api_objects/type_helpers.py TypeScript: agent-config/typeHelpers.ts → typeHelpers.ts (opik package root) Update all import sites in both SDKs: agent_config internals (config, blueprint, base, AgentConfig, Blueprint, index), the runner (in_process_loop, registry), the client (Client.ts), and all corresponding test files. No logic changes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [OPIK-5303] [SDK] refactor: use module-form import for type_helpers in in_process_loop Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [OPIK-5326] [SDK] refactor: standardise input casting around backend type names - Remove backend_type param from backend_value_to_python_value (was unused) - Raise TypeError for truncating int casts ("3.9") and bool→int coercion - registry.extract_params now emits backend type names (integer/boolean/string) so Param.type is consistent with what the server expects - cast_input_value delegates directly to backend_type_to_python_type; the dual Python/backend name lookup (type_name_to_python_type) is removed - Add _execute_job integration tests covering multi-typed params in both Python and TypeScript Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [NA] revert package-lock.json to main Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test(runner): update extract_params assertions to backend type names * fix(runner): treat '1'/'yes' as truthy bool; report cast errors as failed jobs - Extend bool casting to accept "1" and "yes" as truthy values - Move input casting inside _execute_job's try/except so TypeError from invalid inputs is reported to the backend as a failed job instead of propagating uncaught Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test(runner): use fake timers in typed-params TS test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…lude workspace (#6022) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Migration 000061_add_catch_up_columns_to_retention_rules was already executed in staging/prod. It was mistakenly renamed to 000062 and its index was changed in-place, which would break Liquibase checksums. Restore 000061 to its original form and create a new 000062 migration that drops the old index and recreates it with the corrected column order (catch_up_done, enabled, apply_to_past, catch_up_cursor).
…ed runners (#6024) Online scoring E2E tests were consistently timing out (60s) on GitHub-hosted runners in post-merge CI, while passing on self-hosted runners. The scoring pipeline (rule activation -> trace creation -> LLM API call -> score storage) takes longer on resource-constrained GitHub-hosted runners. - Increase test timeout from 60s to 120s - Increase polling attempts from 15 to 25 - Increase page refresh wait from 2s to 3s Co-authored-by: Andrei Căutișanu <andreicautisanu@ip-192-168-1-128.eu-west-1.compute.internal> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…, and UX improvements (#6011) * [NA] [TS SDK] feat: indexed keys in LLMJudge schema, reasoning_effort, and UX improvements Port Python SDK PRs #5690 and #5677 to TypeScript SDK: - Use indexed keys (assertion_1, assertion_2) instead of assertion text as JSON schema property names for cross-provider compatibility (Anthropic, OpenAI character limits) - Refactor buildResponseSchema/parseResponse into ResponseSchema class - Add reasoningEffort option to LLMJudge (defaults to "low") - Add ---BEGIN/END--- delimiters around input/output in LLM judge prompt - Move dashboard link inside result box, remove "Uploading results" message Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(ts-sdk): fromConfig reads description over name for cross-source compatibility Ensures UI-created LLM judge configs (where name="Correctness" but description="Whether the output is correct") deserialize correctly. Also fixes variables format to match Python SDK / backend convention. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(ts-sdk): always show experiment link, even without metrics The result box with the dashboard link was only displayed when metrics were present. Moved getUrl() to processResults so the link is always shown, fixing the evaluate.test.ts regression. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(ts-sdk): treat experiment dashboard link as best-effort Wrap experiment.getUrl() in try/catch so a missing dataset doesn't crash the evaluation results flow. The dashboard link is skipped gracefully if the URL cannot be resolved. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(ts-sdk): mock createExperimentItems in evaluateWithVersion test The test was missing a mock for Experiment.insert's underlying API call, causing unhandled 401 rejections in CI after test teardown. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(ts-sdk): address PR review — cache ResponseSchema, pass reasoningEffort, remove dead code - Cache ResponseSchema as instance field instead of recreating on every score()/toConfig() call - Pass reasoning_effort to generateProviderResponse so the LLM actually receives it at runtime - Remove unused assertions field from ResponseSchema class Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ng (#6021) * [OPIK-5452] [SDK] feat: warn when runner process exits without blocking When a user's script exits without a blocking call (e.g., no server framework), the runner never processes jobs. This adds detection via signal tracking — if the process exits cleanly (no SIGTERM/SIGINT), a warning is printed advising the user to use a server framework like uvicorn/Flask (Python) or express/fastify (TypeScript). Implements OPIK-5452 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(runner): address PR review comments - Consolidate test imports to use only module-level import - Make install_signal_handlers return bool; skip atexit when handlers fail Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…_run() to pass project_name to the OpikClient (#5956) * [OPIK-5297] Add `project_name` to optimization creation in `base_optimizer.py` * [OPIK-5297] Add assertion for `project_name` propagation in optimization creation test * [OPIK-5297] Add `source="optimization"` to span-related test cases * [OPIK-5297] Add `project_name` support to optimizer dataset creation and associated tests * [OPIK-5297] Refactor test fixtures to improve `project_name` propagation and environment setup * [OPIK-5297] Refactor `setup_environment` fixture to use `pytest.MonkeyPatch.context` for improved environment variable management * [OPIK-5297] Set `setup_environment` fixture scope to `session` in E2E tests * [OPIK-5297] Set `setup_environment` fixture to `autouse` and fix import paths in E2E tests * [OPIK-5297] Update `setup_driving_hazard_dataset` fixture to use `Generator` for improved type safety * [OPIK-5297] Introduce old dataset cleanup and improve `setup_environment` fixture in E2E tests * [OPIK-5297] Added more detailed logging * [OPIK-5297] Fixed failing unit test * [OPIK-5297] Refactor dataset creation to use cached client and centralize dataset cleanup logic * [OPIK-5297] Improve logging, typing, and validation across dataset utilities and test fixtures * [OPIK-5297] Removed unnecessary client.end() call * [OPIK-5297] Add Opik server setup and health checks to E2E test workflow * [OPIK-5297] Add verbose summary report to pytest output in E2E workflow * [OPIK-5297] Fixed to avoid race conditions during E2E test execution * [OPIK-5297] Extend E2E workflow timeout to 40 minutes
…#6009) * [OPIK-5427] [FE] Prettify agent sandbox output with SyntaxHighlighter Replace raw <pre> dump with existing SyntaxHighlighter component, enabling Pretty (markdown), JSON, and YAML view modes with copy support. * Revision 2: Handle null agent results in SyntaxHighlighter Use undefined as the no-data sentinel so completed jobs with null results still render the SyntaxHighlighter (showing output: null) instead of silently hiding the result section.
#6001) * Status filter for local runners * address comments * address comments
Co-authored-by: aadereiko <aliaksandr@comet.com>
* [NA] [BE] fix: filter spans by trace_id in project metrics queries Span subqueries in GET_COST, GET_COST_WITH_BREAKDOWN, GET_TOKEN_USAGE, and GET_TOKEN_USAGE_WITH_BREAKDOWN were not scoping spans to the traces returned by traces_filtered. Adding AND trace_id IN (SELECT id FROM traces_filtered) ensures spans are only aggregated for traces that pass all applied filters (time range, name, metadata, feedback scores, etc.). Benchmarked on production (1.9M spans, 7-day window): - Granules read: 25,429 → 4,959 (5x reduction) - GET_TOKEN_USAGE latency: ~2.0s → ~0.6s median (3.5x faster) - GET_COST latency: ~1.6s → ~0.9s median (1.7x faster) * [NA] [BE] fix: scope span subqueries to traces_filtered and add created_at index on authored_feedback_scores - Add AND trace_id IN (SELECT id FROM traces_filtered) to span subqueries in GET_COST, GET_COST_WITH_BREAKDOWN, GET_TOKEN_USAGE, GET_TOKEN_USAGE_WITH_BREAKDOWN. Previously filtering by span.id (5th ORDER BY column) caused full-table scans; the fix reduces granules read from 25,429 to 4,959 (~5x) and query latency by ~2x. - Replace inline dateDiff duration expressions with the MATERIALIZED duration column in TRACE_FILTERED_PREFIX, SPAN_FILTERED_PREFIX, and GET_AVERAGE_DURATION. - Remove FINAL from feedback_scores and authored_feedback_scores reads in TRACE_FILTERED_PREFIX, SPAN_FILTERED_PREFIX, and THREAD_FILTERED_PREFIX, replacing deduplication with ROW_NUMBER() window function which is already applied. - Scope traces_final in THREAD_FILTERED_PREFIX to only traces whose thread_id is in the selected time window (was previously loading all threads in the project). - Add minmax skip index on authored_feedback_scores.created_at (migration 000076). * Update and rename 000076_add_minmax_index_authored_feedback_scores_created_at.sql to 000078_add_minmax_index_authored_feedback_scores_created_at.sql
) * [OPIK-5479] [FE] fix: clear cached pair code on runner connection When a runner connects, the backend immediately consumes the pairing code (Redis getAndDelete). However the frontend kept the stale code in React Query cache. If the runner later disconnected, the empty state re-displayed the expired code, causing users to attempt reconnection with a code that would always fail. Fix: use queryClient.removeQueries() to evict the pair code cache as soon as isConnected becomes true. On subsequent disconnection React Query sees no cached data and fetches a fresh code on demand. * [OPIK-5479] [FE] test: add unit tests for pair code cache invalidation Tests verify that: - Empty state shows pair code when disconnected - Connected state renders when runner is connected - Pair code cache is cleared (removeQueries) on connection - Pair code cache is NOT cleared when disconnected * [OPIK-5479] [FE] fix: lint errors in test file (display names) * [OPIK-5479] [FE] fix: typecheck error — use vi.spyOn return type
…ing usage Extract cache_creation_input_tokens and cache_read_input_tokens from the message_start chunk in ClaudeAggregator.aggregate() and pass them to anthropic_to_bedrock_usage(), so cacheWriteInputTokens and cacheReadInputTokens are correctly tracked in streaming responses. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6 tasks
Contributor
Python SDK Unit Tests Results (Python 3.11)1 tests 0 ✅ 7s ⏱️ For more details on these errors, see this check. Results for commit a4953bc. |
Contributor
Python SDK Unit Tests Results (Python 3.12)1 tests 0 ✅ 7s ⏱️ For more details on these errors, see this check. Results for commit a4953bc. |
Contributor
Python SDK Unit Tests Results (Python 3.14)1 tests 0 ✅ 4s ⏱️ For more details on these errors, see this check. Results for commit a4953bc. |
Contributor
Python SDK Unit Tests Results (Python 3.13)1 tests 0 ✅ 8s ⏱️ For more details on these errors, see this check. Results for commit a4953bc. |
Contributor
Python SDK Unit Tests Results (Python 3.10)1 tests 0 ✅ 8s ⏱️ For more details on these errors, see this check. Results for commit a4953bc. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Details
In
ClaudeAggregator.aggregate(), themessage_startchunk handler only extractedinput_tokensandoutput_tokensfrom the usage dict. The cache token fields (cache_creation_input_tokens,cache_read_input_tokens) were present in the Anthropic streaming response but were never read, soanthropic_to_bedrock_usage()always received zeros for them — causingcacheWriteInputTokensandcacheReadInputTokensto be permanently 0 in the logged span usage.The fix initializes both cache token variables before the chunk loop, extracts them from the
message_startchunk'smessage.usagedict, and includes them in the call toanthropic_to_bedrock_usage(). The non-streaming path was already correct (it passes the full usage dict from the response body).Change checklist
Issues
Testing
New unit tests in
tests/unit/integrations/bedrock/test_claude_aggregator.py:test_claude_aggregator__cache_tokens_in_message_start__included_in_usage— verifies thatcacheWriteInputTokensandcacheReadInputTokensare populated from themessage_startchunk when cache tokens are present.test_claude_aggregator__no_cache_tokens__defaults_to_zero— verifies the zero-default behavior is preserved when cache tokens are absent.To run:
cd sdks/python python -m pytest tests/unit/integrations/bedrock/test_claude_aggregator.py -vDocumentation
No documentation updates needed.